plantR: An R package and workflow for managing species records from biological collections

نویسندگان

چکیده

Biological collections (e.g. museums and herbaria) are essential for studying biodiversity (Graham et al., 2004). Taxonomists use these to describe new species, produce taxonomic revisions species checklists, among other important uses (Bebber 2010; Besnard 2018; Funk, 2003). In macroecology, biogeography conservation, biological often the main source of records, which used study spatial patterns biodiversity, ecological niches, endemism levels conservation status (Dauby 2017; Graham 2004; Lima 2020; Ulloa 2017). increasingly making their electronic databases available in online networks, such as Global Biodiversity Information Facility (GBIF). This growing availability information has catalysed many syntheses our knowledge Antonelli 2018), further emphasising importance collections. The increasing also exposed wide variation documentation standards within between (Willemse 2008). Within collections, specimens collected by different people or periods may vary notation standards. international themselves constantly evolving (www.tdwg.org/standards). Moreover, older records tend have less associated missing geographical coordinates) contain names localities that no longer exist (i.e. changing toponyms). Between differences emerge from choices standards, on how enter specimen databases, fields should be entered first face limited resources. staff little time update been already correct data entry errors typographical errors). These tasks become more challenging number collection increases. Despite global efforts standardise Darwin Core standards), there is still much with records. likely remain years come because underfunded, undervalued understaffed (de Gasper 2020). Online repositories, GBIF, gather, store, flag check some but not all given providers. means that, although highly valuable, always ready (Peterson 2018). So, final users taxonomists, ecologists conservationists) decide performing those procedures trusting without knowing exactly quality. problematic quality can impact outcomes studies taxonomy, ecology Rodrigues Zizka 2019). Thus, we need comprehensive reproducible tools manage particularly regarding identifications, duplicate fine-scale validation coordinates. We present plantR, a R package managing As general approach, plantR does edit original information; it stores standardised columns assist curators comparing edited information. Much functionalities depend gazetteers, maps, lists plant taxonomists provided package. its name suggests, was initially designed herbaria, being currently exclusive plants. However, if input required format, features work any group organisms type record preserved specimens, human observations, etc.). interest biogeographers, conservationists, well implemented (R Team, 2020) details implementation found at https://github.com/LimaRAF/plantR. accompanied proposal workflow process (Figure 1). Here, steps this apply it. They presented an order aims maximise standardisation information, independently previous workflow. Users download directly R. Centro de Referência em Informação Ambiental (CRIA, www.cria.org.br) GBIF (www.gbif.org) sources using functions rspecieslink() rgbif2(), respectively. function rgbif2() performs search based scientific rgbif package, output rspeciesLink() flexible allowing user level, locality. Since two return fields, DwC (function formatDwc()). load own BIEN (https://bien.nceas.ucsb.edu/bien), both converted (DwC) (https://dwc.tdwg.org) formatDwc(). Alternatively, import zipped DwC-Archive files local directory link readData()). Standardisation when combining multiple they follow same provides validating locality assessing confidence level identifications searching across (see Section 3.3). edits performed regard collector identifiers, collector's year formatOcc()). By default, people's returned Standards format (www.tdwg.org/standards/hispid3/), follows: last + comma initials separated points Gentry, A.H.). formatting indicated western names. Name takes into account generational suffixes Junior), prepositions da, dos, von), compound Saint-Hilaire), titles Dr., Profa.) standardises codes database over 5,000 respective Index Herbariorum Xylariorum getCode()). One innovations records' ‘fields country’, ‘stateProvince’, ‘municipality’ ‘locality’; formatLoc()). For instance, transformed English Brasil Brésil Brazil) BR BRA Brazil). case text mining aiming retrieve them available. To make sure exist, cross-checks gazetteer getLoc()). cross-checking standard name-string hierarchically combines best resolution available, thus avoiding spurious matches countries states/provinces strLoc()). default contains entries country lowest administrative GDAM (https://gadm.org) Latin American dependent territories U.S. Virgin Islands). Brazil, farms, forest fragments, parks). Most importantly, provide regional personal gazetteers. includes most common spelling variants historical changes (currently biased Brazil disregarding temporal perform historic matching), allows trace back up-to-date improve getAdmin()). Additionally, assigns coordinate valid getCoord()), working coordinates Besides automated assignment coordinates, formats obtain non-zero decimal degrees prepCoord()). offers fixSpecies()), isolation removal rank var., subsp.) modifiers cf., aff.), containing raw morpho-species, incomplete identifications). botanical families, list family synonyms APG IV angiosperms (The Angiosperm Phylogeny Group, 2016) PPG I lycophytes ferns (Schuettpelz 2016; prepFamily()). If list, genus. Finally, replace synonyms, orthographic prepSpecies()). Currently, Taxonstand (Cayuela 2021) flora (Carvalho, 2020), exact fuzzy matching Plant List (www.theplantlist.org/) Brazilian Flora 2020 (http://floradobrasil.jbrj.gov.br/), compares precision one obtained validateLoc()). comparison possible unknown place names, drop analyses double-check depending goals. Obtaining done overlapping maps checkCoord()). detecting inversion and/or swap checkInverted()), falling sea bays, near shoreline (checkShore()) neighbouring (checkBorders()). after map matches, flagged validated, indication country, state, municipality levels). before, world America. But arguments checkCoord(), must package's internal maps. detect cultivated individuals getCult()) outliers checkOut()), is, too far away core distribution taxon (Liu highlight classification according validateTax()). c. 8,500 compiled (Lima highest three cases: (a) isotypes, holotypes), (b) identified specialist field ‘identifiedBy’), and, optionally, (c) ‘identifiedBy’ empty. ‘unknown’, while non-family specialists ‘low’. long plantR. validateTax() returns frequent identifiers taxonomist Another novelty regards duplicates, collecting event incorporated validateDup()). Sharing material encouraged practice, represent 25% biotas duplicates executed related date municipality). Because great completeness simultaneous combinations getDup()). provided, network analysis graphs, nodes links) find direct indirect connections retrieval relatively large datasets millions records). finding existing requires complete filled typos (or notations standardise). rarely case, so considered cases. only homogenise groups found, mergeDup()). homogenisation retrieving useful digitised fields. After homogenisation, choose whether remove data. See al. (2020) search/merge here. step workflow, help summarise occurrences, species; summaryData()) flags localities, duplicates; summaryFlags()). user-defined numbers voucher export countries, collections). few command lines wrapper Table 1 details). simple example species. A detailed tutorial # Installing remotes::install_github("LimaRAF/plantR") library("plantR") Data occs_splink <- rspeciesLink(species = "Euterpe edulis") occs_gbif rgbif2(species occs formatDwc(splink_data occs_splink, gbif_data occs_gbif) formatOcc(occs) formatLoc(occs) formatCoord(occs) formatTax(occs) validateLoc(occs) validateCoord(occs) validateTax(occs) validateDup(occs) summary summs summaryData(occs) summaryFlags(occs) checklist checkList(occs) Some plantR's packages (Table Function (Chamberlain downloading management strings, stringr (Wickham, 2019), countrycode (Arel-Bundock 2018) sf (Pebesma, mentioned above, prepSpecies() igraph (Csardi & Nepusz, 2006) search. data.table (Dowle Srinivasan, fast table manipulation, reading saving. Other synonym checks Cayuela 2021; Chamberlain Szöcs, 2013; Kindt, ‘reinvent wheel’ were will be) integrated CoordinateCleaner (Zizka 2019) toolbox suggest advanced editing differential lies providing validation, automatic county level. validations gazetteer, mainly approach getCult()), ‘locality’ ‘occurrenceRemarks’ CoordinateCleaner. validates naturaList (Rodrigues ‘identifiedBy’, identification user-provided taxonomists. relies provision besides possibility extra addition, ‘typeStatus’ aware edition duplicates. dates huge; handles cases them. envisage collectors' identifier's (based recently created terms ‘recordedByID’ ‘identifiedByID’), today double-checking necessary. county-level towards improvements predicted future versions include repositories JABOT, http://jabot.jbrj.gov.br), expansion (starting tropical regions) against wider coverage Catalogue Life). plan prepare modleR ConR; Dauby Sánchez-Tapia facilitate citation occCite; Owens collect provenance rdt; Lerner improved; open receiving suggestions incorporating audience. Therefore, foresee increase coverage, solutions regions organisms. made greatly increased decades probably continue Sweeney having assess pressing issue research. tools, time. Although similar greatest strength user-friendly beginning end single environment. expect reproducibility taxonomic, studies. hope attention, saving conducting task maintaining supported European Union's Horizon research innovation program under Marie Skłodowska-Curie grant agreement No 795114. M.F.d.S., A.S.-T. S.R.M. Coordination Improvement Higher Education Personnel—CAPES (process 88887.145924/2017-00), PNPD/CAPES PCI ‘Instituto Nacional da Mata Atlântica’ (INMA), thank Sidnei Souza CRIA his web API. CNCFlora TreeCo construct Vinícius C. (ESALQ/USP) who helped curate thankful Harvard University Herbarium, Herbaria Network Society consulted build current recognition 2021 Ebbe Nielsen Challenge. authors declare conflict interest. R.A.F.d.L. conceived idea R.A.F.d.L., A.S.-T., M.F.d.S. methodology; constructed maps; H.t.S. wrote documentations; led writing manuscript, contributions All contributed critically manuscript gave approval publication. peer review history article https://publons.com/publon/10.1111/2041-210X.13779. freely GitHub version described paper (version 0.1.4) archived Zenodo https://doi.org/10.5281/zenodo.5711723 2021).

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phybase: an R package for species tree analysis

MOTIVATION Phybase is an R package for phylogenetic analysis using species trees. It provides functions to read, write, manipulate, simulate, estimate, summarize and plot species trees, which contain not only the topology and branch lengths but also population sizes. AVAILABILITY The Phybase package is available at the R repository. The manual and supporting materials including source code, s...

متن کامل

Facilitating pharmacometric workflow with the metrumrg package for R

metrumrg is an R package that facilitates workflow for the discipline of pharmacometrics. Support is provided for data preparation, modeling, simulation, diagnostics, and reporting. Existing tools and techniques are emphasized where available; original solutions are provided for otherwise unmet needs. In particular, metrumrg implements an R interface for the NONMEM(®) modeling software, optiona...

متن کامل

rich: An R Package to Analyse Species Richness

The paper describes rich, a new R package to perform species richness estimation and comparison. Species richness is the simplest surrogate for the more complex concept of species biodiversity. It is relatively easy to assess although estimations strongly depend on sampling intensity with the consequence that richness estimations should be standardized to perform valid comparisons. The R packag...

متن کامل

gems: An R Package for Simulating from Disease Progression Models.

Mathematical models of disease progression predict disease outcomes and are useful epidemiological tools for planners and evaluators of health interventions. The 𝖱 package gems is a tool that simulates disease progression in patients and predicts the effect of different interventions on patient outcome. Disease progression is represented by a series of events (e.g., diagnosis, treatment and dea...

متن کامل

NFP: An R Package for Characterizing and Comparing of Annotated Biological Networks

Large amounts of various biological networks exist for representing different types of interaction data, such as genetic, metabolic, gene regulatory, and protein-protein relationships. Recent approaches on biological network study are based on different mathematical concepts. It is necessary to construct a uniform framework to judge the functionality of biological networks. We recently introduc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Methods in Ecology and Evolution

سال: 2021

ISSN: ['2041-210X']

DOI: https://doi.org/10.1111/2041-210x.13779